Goto

Collaborating Authors

 xgboost classifier


Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion

Salem, Sepehr, Esfahani, M. Moein, Liu, Jingyu, Calhoun, Vince

arXiv.org Artificial Intelligence

Abstract--Data scarcity hinders deep learning for medical imaging. We propose a framework for breast cancer classification in thermograms that addresses this using a Diffusion Probabilistic Model (DPM) for data augmentation. Our DPM-based augmentation is shown to be superior to both traditional methods and a ProGAN baseline. The framework fuses deep features from a pre-trained ResNet-50 with handcrafted nonlinear features (e.g., Fractal Dimension) derived from U-Net segmented tumors. An XGBoost classifier trained on these fused features achieves 98.0% accuracy and 98.1% sensitivity. Ablation studies and statistical tests confirm that both the DPM augmentation and the nonlinear feature fusion are critical, statistically significant components of this success. This work validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.


Exploring Molecular Odor Taxonomies for Structure-based Odor Predictions using Machine Learning

Sajan, Akshay, Sluis, Stijn, Haydarlou, Reza, Abeln, Sanne, Lisena, Pasquale, Troncy, Raphael, Verbeek, Caro, Leemans, Inger, Mouhib, Halima

arXiv.org Artificial Intelligence

One of the key challenges to predict odor from molecular structure is unarguably our limited understanding of the odor space and the complexity of the underlying structure-odor relationships. Here, we show that the predictive performance of machine learning models for structure-based odor predictions can be improved using both, an expert and a data-driven odor taxonomy. The expert taxonomy is based on semantic and perceptual similarities, while the data-driven taxonomy is based on clustering co-occurrence patterns of odor descriptors directly from the prepared dataset. Both taxonomies improve the predictions of different machine learning models and outperform random groupings of descriptors that do not reflect existing relations between odor descriptors. We assess the quality of both taxonomies through their predictive performance across different odor classes and perform an in-depth error analysis highlighting the complexity of odor-structure relationships and identifying potential inconsistencies within the taxonomies by showcasing pear odorants used in perfumery. The data-driven taxonomy allows us to critically evaluate our expert taxonomy and better understand the molecular odor space. Both taxonomies as well as a full dataset are made available to the community, providing a stepping stone for a future community-driven exploration of the molecular basis of smell. In addition, we provide a detailed multi-layer expert taxonomy including a total of 777 different descriptors from the Pyrfume repository.


Outperformance Score: A Universal Standardization Method for Confusion-Matrix-Based Classification Performance Metrics

Zhao, Ningsheng, Bui, Trang, Yu, Jia Yuan, Dzieciolowski, Krzysztof

arXiv.org Machine Learning

Many classification performance metrics exist, each suited to a specific application. However, these metrics often differ in scale and can exhibit varying sensitivity to class imbalance rates in the test set. As a result, it is difficult to use the nominal values of these metrics to interpret and evaluate classification performances, especially when imbalance rates vary. To address this problem, we introduce the outperformance score function, a universal standardization method for confusion-matrix-based classification performance (CMBCP) metrics. It maps any given metric to a common scale of $[0,1]$, while providing a clear and consistent interpretation. Specifically, the outperformance score represents the percentile rank of the observed classification performance within a reference distribution of possible performances. This unified framework enables meaningful comparison and monitoring of classification performance across test sets with differing imbalance rates. We illustrate how the outperformance scores can be applied to a variety of commonly used classification performance metrics and demonstrate the robustness of our method through experiments on real-world datasets spanning multiple classification applications.


Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues

Cipollone, Daniele, Wang, Changjie, Scazzariello, Mariano, Ferlin, Simone, Izadi, Maliheh, Kostic, Dejan, Chiesa, Marco

arXiv.org Artificial Intelligence

In today's digital landscape, the importance of timely and accurate vulnerability detection has significantly increased. This paper presents a novel approach that leverages transformer-based models and machine learning techniques to automate the identification of software vulnerabilities by analyzing GitHub issues. We introduce a new dataset specifically designed for classifying GitHub issues relevant to vulnerability detection. We then examine various classification techniques to determine their effectiveness. The results demonstrate the potential of this approach for real-world application in early vulnerability detection, which could substantially reduce the window of exploitation for software vulnerabilities. This research makes a key contribution to the field by providing a scalable and computationally efficient framework for automated detection, enabling the prevention of compromised software usage before official notifications. This work has the potential to enhance the security of open-source software ecosystems.


SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

Consoli, Bernardo, Wu, Xizhi, Wang, Song, Zhao, Xinyu, Wang, Yanshan, Rousseau, Justin, Hartvigsen, Tom, Shen, Li, Wu, Huanmei, Peng, Yifan, Long, Qi, Chen, Tianlong, Ding, Ying

arXiv.org Artificial Intelligence

Extracting social determinants of health (SDoH) from unstructured medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. In this study we introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method leveraging contrastive examples and concise instructions to extract SDoH without relying on extensive medical annotations or costly human intervention. It achieved tenfold and twentyfold reductions in time and cost respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the strengths of both, ensuring high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores. Testing across three distinct datasets has confirmed its robustness and accuracy. This study highlights the potential of leveraging LLMs to revolutionize medical note classification, demonstrating their capability to achieve highly accurate classifications with significantly reduced time and cost.


PSO Fuzzy XGBoost Classifier Boosted with Neural Gas Features on EEG Signals in Emotion Recognition

Mousavi, Seyed Muhammad Hossein

arXiv.org Artificial Intelligence

Emotion recognition is the technology-driven process of identifying and categorizing human emotions from various data sources, such as facial expressions, voice patterns, body motion, and physiological signals, such as EEG. These physiological indicators, though rich in data, present challenges due to their complexity and variability, necessitating sophisticated feature selection and extraction methods. NGN, an unsupervised learning algorithm, effectively adapts to input spaces without predefined grid structures, improving feature extraction from physiological data. Furthermore, the incorporation of fuzzy logic enables the handling of fuzzy data by introducing reasoning that mimics human decision-making. The combination of PSO with XGBoost aids in optimizing model performance through efficient hyperparameter tuning and decision process optimization. This study explores the integration of Neural-Gas Network (NGN), XGBoost, Particle Swarm Optimization (PSO), and fuzzy logic to enhance emotion recognition using physiological signals. Our research addresses three critical questions concerning the improvement of XGBoost with PSO and fuzzy logic, NGN's effectiveness in feature selection, and the performance comparison of the PSO-fuzzy XGBoost classifier with standard benchmarks. Acquired results indicate that our methodologies enhance the accuracy of emotion recognition systems and outperform other feature selection techniques using the majority of classifiers, offering significant implications for both theoretical advancement and practical application in emotion recognition technology.


BrainMetDetect: Predicting Primary Tumor from Brain Metastasis MRI Data Using Radiomic Features and Machine Learning Algorithms

Sadeghsalehi, Hamidreza

arXiv.org Artificial Intelligence

Objective: Brain metastases (BMs) are common in cancer patients and determining the primary tumor site is crucial for effective treatment. This study aims to predict the primary tumor site from BM MRI data using radiomic features and advanced machine learning algorithms. Methods: We utilized a comprehensive dataset from Ocana-Tienda et al. (2023) comprising MRI and clinical data from 75 patients with BMs. Radiomic features were extracted from post-contrast T1-weighted MRI sequences. Feature selection was performed using the GINI index, and data normalization was applied to ensure consistent scaling. We developed and evaluated Random Forest and XGBoost classifiers, both with and without hyperparameter optimization using the FOX (Fox optimizer) algorithm. Model interpretability was enhanced using SHAP (SHapley Additive exPlanations) values. Results: The baseline Random Forest model achieved an accuracy of 0.85, which improved to 0.93 with FOX optimization. The XGBoost model showed an initial accuracy of 0.96, increasing to 0.99 after optimization. SHAP analysis revealed the most influential radiomic features contributing to the models' predictions. The FOX-optimized XGBoost model exhibited the best performance with a precision, recall, and F1-score of 0.99. Conclusion: This study demonstrates the effectiveness of using radiomic features and machine learning to predict primary tumor sites from BM MRI data. The FOX optimization algorithm significantly enhanced model performance, and SHAP provided valuable insights into feature importance. These findings highlight the potential of integrating radiomics and machine learning into clinical practice for improved diagnostic accuracy and personalized treatment planning.


RE-GrievanceAssist: Enhancing Customer Experience through ML-Powered Complaint Management

C, Venkatesh, Oberoi, Harshit, Pandey, Anurag Kumar, Goyal, Anil, Sikka, Nikhil

arXiv.org Artificial Intelligence

In recent years, digital platform companies have faced increasing challenges in managing customer complaints, driven by widespread consumer adoption. This paper introduces an end-to-end pipeline, named RE-GrievanceAssist, designed specifically for real estate customer complaint management. The pipeline consists of three key components: i) response/no-response ML model using TF-IDF vectorization and XG-Boost classifier; ii) user type classifier using fasttext classifier; iii) issue/sub-issue classifier using TF-IDF vectorization and XGBoost classifier. Finally, it has been deployed as a batch job in Databricks, resulting in a remarkable 40% reduction in overall manual effort with monthly cost reduction of Rs 1,50,000 since August 2023.

  Country: Asia > India (0.05)
  Genre: Research Report (0.40)
  Industry: Banking & Finance > Real Estate (0.37)

GWPT: A Green Word-Embedding-based POS Tagger

Wei, Chengwei, Pang, Runqi, Kuo, C. -C. Jay

arXiv.org Artificial Intelligence

As a fundamental tool for natural language processing (NLP), the part-of-speech (POS) tagger assigns the POS label to each word in a sentence. A novel lightweight POS tagger based on word embeddings is proposed and named GWPT (green word-embedding-based POS tagger) in this work. Following the green learning (GL) methodology, GWPT contains three modules in cascade: 1) representation learning, 2) feature learning, and 3) decision learning modules. The main novelty of GWPT lies in representation learning. It uses non-contextual or contextual word embeddings, partitions embedding dimension indices into low-, medium-, and high-frequency sets, and represents them with different N-grams. It is shown by experimental results that GWPT offers state-of-the-art accuracies with fewer model parameters and significantly lower computational complexity in both training and inference as compared with deep-learning-based methods.


Enhancing Edge Intelligence with Highly Discriminant LNT Features

Wang, Xinyu, Mishra, Vinod K., Kuo, C. -C. Jay

arXiv.org Artificial Intelligence

AI algorithms at the edge demand smaller model sizes and lower computational complexity. To achieve these objectives, we adopt a green learning (GL) paradigm rather than the deep learning paradigm. GL has three modules: 1) unsupervised representation learning, 2) supervised feature learning, and 3) supervised decision learning. We focus on the second module in this work. In particular, we derive new discriminant features from proper linear combinations of input features, denoted by x, obtained in the first module. They are called complementary and raw features, respectively. Along this line, we present a novel supervised learning method to generate highly discriminant complementary features based on the least-squares normal transform (LNT). LNT consists of two steps. First, we convert a C-class classification problem to a binary classification problem. The two classes are assigned with 0 and 1, respectively. Next, we formulate a least-squares regression problem from the N-dimensional (N-D) feature space to the 1-D output space, and solve the least-squares normal equation to obtain one N-D normal vector, denoted by a1. Since one normal vector is yielded by one binary split, we can obtain M normal vectors with M splits. Then, Ax is called an LNT of x, where transform matrix A in R^{M by N} by stacking aj^T, j=1, ..., M, and the LNT, Ax, can generate M new features. The newly generated complementary features are shown to be more discriminant than the raw features. Experiments show that the classification performance can be improved by these new features.